UPSTREAM PR #17453: convert : allow quantizing lora again by loci-dev · Pull Request #296 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-23T18:39:55Z

Allow quantizing LoRA at conversion again, but default to F32 (as has been the norm since #8980 inadvertently forced this).

Fixes #17447
Fixes #10671

loci-review · 2025-11-23T19:12:47Z

Explore the complete analysis inside the Version Insights

Pull Request Performance Summary

PR #296: UPSTREAM PR #17453 - Allow Quantizing LoRA Again

Assessment

This PR modifies Python conversion scripts (convert_hf_to_gguf.py and convert_lora_to_gguf.py) to restore LoRA quantization functionality and change the default output format from F16 to F32.

Performance Impact: No changes detected. Static analysis shows < 0.001% power consumption variation across all 16 binaries. No functions exhibit measurable response time or throughput changes between versions. The modifications affect only the model conversion pipeline, not the compiled inference runtime.

Code Changes

1. LoRA Quantization Re-enabled (convert_hf_to_gguf.py:568)

Changed tensor suffix check from not new_name.endswith(".weight") to new_name[-7:] not in (".weight", ".lora_a", ".lora_b")
Allows .lora_a and .lora_b tensors to be quantized during conversion
Fixes regression where LoRA adapters were forced to F32 only

2. Default Output Format (convert_lora_to_gguf.py:245)

Changed default from "f16" to "f32" for LoRA conversion
Prioritizes accuracy over file size
Users can explicitly specify --outtype f16 for smaller files

Impact Analysis

Runtime Performance: None. These are conversion-time scripts, not part of the compiled binaries analyzed. The inference engine (llama_decode, llama_tokenize, ggml_backend_graph_compute) remains unchanged.

Binary Analysis: All 16 binaries (including build.bin.libllama.so, build.bin.libggml-base.so) show identical power consumption profiles between versions.

User Impact:

F32 LoRA adapters consume 2x memory vs F16 (typically <100MB increase)
No impact on tokens per second or inference latency
Existing scripts using default arguments will produce larger files

Correctness: Logic is sound. The slice notation [-7:] correctly handles all tensor name lengths. Conservative F32 default prevents potential accuracy degradation.

Conclusion

No performance-related concerns. Changes are limited to conversion tooling with no runtime impact. The PR successfully restores LoRA quantization while maintaining conservative defaults.

allow quantizing lora again

cb096de

loci-dev temporarily deployed to PROD__AL_DEMO November 23, 2025 18:39 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 409b78f to b789b13 Compare November 27, 2025 00:34

loci-dev force-pushed the main branch 26 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 4 times, most recently from d4c3480 to f998d1f Compare February 15, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17453: convert : allow quantizing lora again#296

UPSTREAM PR #17453: convert : allow quantizing lora again#296
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17453-branch_ggml-org-cisc/convert-lora-allow-quantize

loci-dev commented Nov 23, 2025

Uh oh!

loci-review bot commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 23, 2025

Uh oh!

loci-review bot commented Nov 23, 2025

Pull Request Performance Summary

Assessment

Code Changes

Impact Analysis

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants